Is Cache Oblivious DGEMM a Viable Alternative?
نویسندگان
چکیده
We present an in-depth study of various implementations of DGEMM, using both the recursive and iterative programming styles. Recursive algorithms for DGEMM are usually cache-oblivious and they automatically block DGEMM’s operands A, B, C for the memory hierarchy. Iterative algorithms for DGEMM explicitly block A, B, C for the L1 cache, higher caches and memory. Our study shows that recursive DGEMM implementations cannot achieve the high performance of blocked iterative algorithms. 1 A study of Recursive and Interative Algorithms for DGEMM The performance of DGEMM on modern computers is limited by the performance of the memory system in two ways. First, the latency of memory accesses can be many hundreds of cycles, so the processor may be stalled most of the time, waiting for reads to complete. Second, the bandwidth from memory is usually far less than the rate at which the processor can consume data. This contribution examines in depth these limitations for both recursive and iterative programming styles. We describe the results of a study of the performance of highly-optimized cache-oblivious and cache-conscious programs for DGEMM on four modern architectures: IBM Power 5, Sun UltraSPARC IIIi, Intel Itanium 2, and Intel Pentium 4 Xeon. These programs are generated by a domain-specific compiler we are building called BRILA (Block Recursive Implementation of Linear Algebra). The compiler takes recursive descriptions of linear algebra problems, and produces optimized iterative or recursive programs as output. Our main finding is that there is a significant gap between the performance of cache-oblivious and cache-conscious algorithms in this domain. We also provide insights into why this gap exists. We are not aware of any similar study in the literature. First we motivate approximate blocking by giving a quantitative analysis of how blocking can reduce the required bandwidth from memory. This analysis provides a novel way of looking at the memory hierarchy behavior of cacheoblivious programs.
منابع مشابه
Funnel Heap - A Cache Oblivious Priority Queue
The cache oblivious model of computation is a two-level memory model with the assumption that the parameters of the model are unknown to the algorithms. A consequence of this assumption is that an algorithm efficient in the cache oblivious model is automatically efficient in a multi-level memory model. Arge et al. recently presented the first optimal cache oblivious priority queue, and demonstr...
متن کاملQuickheaps : Simple , Efficient , and Cache - Oblivious
We present the Quickheap, a simple and efficient data structure for implementing priority queues in main and secondary memory. Quickheaps are comparable with classical binary heaps in simplicity, but are more cache-friendly. This makes them an excellent alternative for a secondary memory implementation. We show that the average amortized CPU cost per operation over a Quickheap of m elements is ...
متن کاملEvaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor
In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running "natively" on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimiza...
متن کاملSuperscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library
Recently, a rst version of our GEMM-based level 3 BLAS for superscalar type processors was announced. A new feature is the inclusion of DGEMM itself. This DGEMM routine contains inline what we call a level 3 kernel routine, which is based on register blocking. Additionally, it features level 1 cache blocking and data copying of sub-matrix operands for the level 3 kernel. Our other BLAS's which ...
متن کاملA Comparison of Cache Aware and Cache Oblivious Static Search Trees Using Program Instrumentation
An experimental comparison of cache aware and cache oblivious static search tree algorithms is presented. Both cache aware and cache oblivious algorithms outperform classic binary search on large data sets because of their better utilization of cache memory. Cache aware algorithms with implicit pointers perform best overall, but cache oblivious algorithms do almost as well and do not have to be...
متن کامل